在嘈杂的互联网规模数据集上进行了预测,已对具有广泛的文本,图像和其他模式能力的培训模型进行了大量研究。但是,对于许多顺序决策域,例如机器人技术,视频游戏和计算机使用,公开可用的数据不包含以相同方式训练行为先验所需的标签。我们通过半监督的模仿学习将互联网规模的预处理扩展到顺序的决策域,其中代理通过观看在线未标记的视频来学习行动。具体而言,我们表明,使用少量标记的数据,我们可以训练一个足够准确的反向动力学模型,可以标记一个巨大的未标记在线数据来源 - 在这里,在线播放Minecraft的在线视频 - 然后我们可以从中训练一般行为先验。尽管使用了本地人类界面(鼠标和键盘为20Hz),但我们表明,这种行为先验具有非平凡的零射击功能,并且可以通过模仿学习和加强学习,可以对其进行微调,以进行硬探索任务。不可能通过增强学习从头开始学习。对于许多任务,我们的模型都表现出人类水平的性能,我们是第一个报告可以制作钻石工具的计算机代理,这些工具可以花费超过20分钟(24,000个环境动作)的游戏玩法来实现。
translated by 谷歌翻译
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
The appearance of an object can be fleeting when it transforms. As eggs are broken or paper is torn, their color, shape and texture can change dramatically, preserving virtually nothing of the original except for the identity itself. Yet, this important phenomenon is largely absent from existing video object segmentation (VOS) benchmarks. In this work, we close the gap by collecting a new dataset for Video Object Segmentation under Transformations (VOST). It consists of more than 700 high-resolution videos, captured in diverse environments, which are 20 seconds long on average and densely labeled with instance masks. A careful, multi-step approach is adopted to ensure that these videos focus on complex object transformations, capturing their full temporal extent. We then extensively evaluate state-of-the-art VOS methods and make a number of important discoveries. In particular, we show that existing methods struggle when applied to this novel task and that their main limitation lies in over-reliance on static appearance cues. This motivates us to propose a few modifications for the top-performing baseline that improve its capabilities by better modeling spatio-temporal information. But more broadly, the hope is to stimulate discussion on learning more robust video object representations.
translated by 谷歌翻译
Compact and accurate representations of 3D shapes are central to many perception and robotics tasks. State-of-the-art learning-based methods can reconstruct single objects but scale poorly to large datasets. We present a novel recursive implicit representation to efficiently and accurately encode large datasets of complex 3D shapes by recursively traversing an implicit octree in latent space. Our implicit Recursive Octree Auto-Decoder (ROAD) learns a hierarchically structured latent space enabling state-of-the-art reconstruction results at a compression ratio above 99%. We also propose an efficient curriculum learning scheme that naturally exploits the coarse-to-fine properties of the underlying octree spatial representation. We explore the scaling law relating latent space dimension, dataset size, and reconstruction accuracy, showing that increasing the latent space dimension is enough to scale to large shape datasets. Finally, we show that our learned latent space encodes a coarse-to-fine hierarchical structure yielding reusable latents across different levels of details, and we provide qualitative evidence of generalization to novel shapes outside the training set.
translated by 谷歌翻译
Although deep networks have shown vulnerability to evasion attacks, such attacks have usually unrealistic requirements. Recent literature discussed the possibility to remove or not some of these requirements. This paper contributes to this literature by introducing a carpet-bombing patch attack which has almost no requirement. Targeting the feature representations, this patch attack does not require knowing the network task. This attack decreases accuracy on Imagenet, mAP on Pascal Voc, and IoU on Cityscapes without being aware that the underlying tasks involved classification, detection or semantic segmentation, respectively. Beyond the potential safety issues raised by this attack, the impact of the carpet-bombing attack highlights some interesting property of deep network layer dynamic.
translated by 谷歌翻译
Machine Learning models capable of handling the large datasets collected in the financial world can often become black boxes expensive to run. The quantum computing paradigm suggests new optimization techniques, that combined with classical algorithms, may deliver competitive, faster and more interpretable models. In this work we propose a quantum-enhanced machine learning solution for the prediction of credit rating downgrades, also known as fallen-angels forecasting in the financial risk management field. We implement this solution on a neutral atom Quantum Processing Unit with up to 60 qubits on a real-life dataset. We report competitive performances against the state-of-the-art Random Forest benchmark whilst our model achieves better interpretability and comparable training times. We examine how to improve performance in the near-term validating our ideas with Tensor Networks-based numerical simulations.
translated by 谷歌翻译
Coordinate-based implicit neural networks, or neural fields, have emerged as useful representations of shape and appearance in 3D computer vision. Despite advances however, it remains challenging to build neural fields for categories of objects without datasets like ShapeNet that provide canonicalized object instances that are consistently aligned for their 3D position and orientation (pose). We present Canonical Field Network (CaFi-Net), a self-supervised method to canonicalize the 3D pose of instances from an object category represented as neural fields, specifically neural radiance fields (NeRFs). CaFi-Net directly learns from continuous and noisy radiance fields using a Siamese network architecture that is designed to extract equivariant field features for category-level canonicalization. During inference, our method takes pre-trained neural radiance fields of novel object instances at arbitrary 3D pose, and estimates a canonical field with consistent 3D pose across the entire category. Extensive experiments on a new dataset of 1300 NeRF models across 13 object categories show that our method matches or exceeds the performance of 3D point cloud-based methods.
translated by 谷歌翻译
Extensible objects form a challenging case for NRSfM, owing to the lack of a sufficiently constrained extensible model of the point-cloud. We tackle the challenge by proposing 1) convex relaxations of the isometric model up to quasi-isometry, and 2) convex relaxations involving the equiareal deformation model, which preserves local area and has not been used in NRSfM. The equiareal model is appealing because it is physically plausible and widely applicable. However, it has two main difficulties: first, when used on its own, it is ambiguous, and second, it involves quartic, hence highly nonconvex, constraints. Our approach handles the first difficulty by mixing the equiareal with the isometric model and the second difficulty by new convex relaxations. We validate our methods on multiple real and synthetic data, including well-known benchmarks.
translated by 谷歌翻译
A wide range of techniques have been proposed in recent years for designing neural networks for 3D data that are equivariant under rotation and translation of the input. Most approaches for equivariance under the Euclidean group $\mathrm{SE}(3)$ of rotations and translations fall within one of the two major categories. The first category consists of methods that use $\mathrm{SE}(3)$-convolution which generalizes classical $\mathbb{R}^3$-convolution on signals over $\mathrm{SE}(3)$. Alternatively, it is possible to use \textit{steerable convolution} which achieves $\mathrm{SE}(3)$-equivariance by imposing constraints on $\mathbb{R}^3$-convolution of tensor fields. It is known by specialists in the field that the two approaches are equivalent, with steerable convolution being the Fourier transform of $\mathrm{SE}(3)$ convolution. Unfortunately, these results are not widely known and moreover the exact relations between deep learning architectures built upon these two approaches have not been precisely described in the literature on equivariant deep learning. In this work we provide an in-depth analysis of both methods and their equivalence and relate the two constructions to multiview convolutional networks. Furthermore, we provide theoretical justifications of separability of $\mathrm{SE}(3)$ group convolution, which explain the applicability and success of some recent approaches. Finally, we express different methods using a single coherent formalism and provide explicit formulas that relate the kernels learned by different methods. In this way, our work helps to unify different previously-proposed techniques for achieving roto-translational equivariance, and helps to shed light on both the utility and precise differences between various alternatives. We also derive new TFN non-linearities from our equivalence principle and test them on practical benchmark datasets.
translated by 谷歌翻译
We show that for a plane imaged by an endoscope the specular isophotes are concentric circles on the scene plane, which appear as nested ellipses in the image. We show that these ellipses can be detected and used to estimate the plane's normal direction, forming a normal reconstruction method, which we validate on simulated data. In practice, the anatomical surfaces visible in endoscopic images are locally planar. We use our method to show that the surface normal can thus be reconstructed for each of the numerous specularities typically visible on moist tissues. We show results on laparoscopic and colonoscopic images.
translated by 谷歌翻译